包含13个变量,1599行或观察。在2009年被Paulo Cortez等人创建 目标:哪个化学成分影响红葡萄酒的质量?
## [1] "/Users/soukeihime/Documents/My Projects/second_work_s"
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
通过观察,红葡萄酒的质量应该主要与酒精含量有关 下面进行 研究 ##单变量的研究
(1)评级的直方图
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
大部分给红葡萄酒的分数是5或六分, 相当于对专业评酒员来说这批红葡萄酒普遍属于 中等水平,没有太高分,没有太低分。
(2)不易挥发酸度的直方图fixed.acidity
不易挥发酸度的直方图类似于正态分布稍左偏,大部分酸度在7或8,其他相对离散
(3)挥发性酸度直方图volatile.acidity
挥发性酸度分布相对均匀,有小部分较高酸度
(4)柠檬酸直方图
柠檬酸度分布相对均匀,有小部分较高酸度
(5)余糖直方图residual.sugar
直方图左偏,大部分集中在2处
(6)氯化物直方图
直方图左偏,大部分集中在较小的值,可见大部分氯化物含量很少
(7)游离二氧化硫直方图
直方图左偏,成一定的梯度,随游离二氧化硫增加,数量逐渐减少
(8)总二氧化硫直方图
总二氧化硫轻微左偏,分布相对游离二氧化硫均匀
(9)密度的直方图
密度的直方图近似正态分布
(10)pH的直方图
pH的直方图近似正态分布
(11)硫酸盐直方图
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
硫酸盐直方图左偏,大部分在0.6附近
(12)酒精含量直方图
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
直方图左偏,大部分酒精含量在9附近,高酒精较少
通过12幅图对单变量的研究,我们对数据有了基本的认识 1 数据的结构是什么? 这个整⻬的数据集包含1599 种红酒,以及 11 个关于酒的化学成分的变量。 ⾄少 3 名葡萄酒专家对每种酒的质量进⾏了评分(quality变量),分数在 0(⾮常差)和 10(⾮常好)之间.评分的中位数是6.
2 数据集中最感兴趣的特性是什么? 数据集的只要特点是酒精含量与获得的评分。我想确定哪些特性最适合预测红葡萄酒的 质量。我怀疑酒精含量和一些其他变量的组合可以用来建立一个预测模型关于红葡萄酒 的质量。
3 在那些你感兴趣的特性中,你认为数据集中哪些其他的特性可以帮助你去研究? 酒精含量,柠檬酸,硫酸盐可能对红葡萄酒的质量有影响
4 你可以从现有的变量中再创建新的变量吗? 真的,从已知的数据集中,我目前并想不到怎样构建创建一个新的变量 能有益于我的研究
5 在你研究的这些变量中,有什么不同寻常的分布? 你是否会对数据进行整理或更改数据的格式,如果是,为什么要这么做? 大部分的分布还是很趋向于正态分布,所以我并没有再进行整理
这幅图说明数据集中每个变量两两之间的关系和他们自己的曲线变化 其中各个变量与质量评分之间的关系是我主要关注的 其中可以看到有正向关系的变量有酒精含量,硫酸盐,柠檬酸
##
## Pearson's product-moment correlation
##
## data: win$fixed.acidity and win$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: win$volatile.acidity and win$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: win$alcohol and win$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: win$sulphates and win$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: win$pH and win$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: win$citric.acid and win$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: win$density and win$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: win$total.sulfur.dioxide and win$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: win$free.sulfur.dioxide and win$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: win$chlorides and win$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: win$residual.sugar and win$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
我们可以看到线性拟合左下角应该往下一些
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
names(win)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
1 谈论一下你在这部分研究中所观察到的一些关联,一些特性是怎样随一些特性 变化的? 质量评级红葡萄酒中酒精含量有一定的相关性 在某些程度上,酒精含量越高,红葡萄酒的评级越好。但这种相关性 并不十分明显。为0.476. 另外一些特性可以被纳入模型来考虑质量评级的差异
2 除了酒精含量,你还看到有别的什么有趣的关联吗? chlorides,residual.sugar这两个变量,似乎对质量评级没有什么影响。
3 你发现最强的关系是什么? 红葡萄酒的质量评级与其中酒精的含量是积极和这里面最强相关的了
以质量评级当做分类变量,用渐变的线条 来观察多个变量与质量评级之间的关系 ###density VS alcohol VS quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
1 讨论一下你在这部分研究中观察到的一些关系,有哪些特性可以 加强 你所关心的特性? sulphates很符合这一特点,同一酒精含量中sulphates越高,评级分数也越高, 这种渐变的颜色非常适合观察
2 在各个特性之间有没有有趣或令人惊讶的关联? volatile.acidity的含量似乎是越低越好,但是最后还是有一个交叉 结果发现酒精含量越高,volatile.acidity含量反而降低
在这部分我会选取本次数据分析中最具代表的三个图 做一个总结
第一幅:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
第一幅,研究结果变量,大部分位于5,6评分,奠定了整个数据集的基础。 评分基本上都在中等,没有1,2分,没有9,10分 较高较低分也只占小部分
第二幅:
第二幅,研究酒精含量与评分的积极关系 使用箱型图和散点图分析这两个变狼之间的关系 随着质量评分的升高,酒精含量也在升高
第三幅:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
第三幅,其他变量,如sulphates这个代表变量与酒精,与评分的关系 除了酒精含量,sulphates也明显表现出随他含量的升高,质量评级也在升高
names(win)
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = win)
## m2: lm(formula = I(quality) ~ I(alcohol) + alcohol, data = win)
## m3: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH, data = win)
## m4: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density,
## data = win)
## m5: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates, data = win)
## m6: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity, data = win)
## m7: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar, data = win)
## m8: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar + citric.acid,
## data = win)
## m9: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar + citric.acid +
## volatile.acidity, data = win)
## m10: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar + citric.acid +
## volatile.acidity + chlorides, data = win)
## m11: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar + citric.acid +
## volatile.acidity + chlorides + free.sulfur.dioxide, data = win)
## m12: lm(formula = I(quality) ~ I(alcohol) + alcohol + pH + density +
## sulphates + fixed.acidity + residual.sugar + citric.acid +
## volatile.acidity + chlorides + free.sulfur.dioxide + total.sulfur.dioxide,
## data = win)
##
## =====================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.875*** 4.426*** -9.593 5.459 60.572*** 77.428*** 75.670*** 33.834 27.461 28.751 21.965
## (0.175) (0.175) (0.387) (11.293) (11.213) (17.293) (21.762) (21.756) (21.270) (21.247) (21.267) (21.195)
## I(alcohol) 0.361*** 0.361*** 0.386*** 0.397*** 0.365*** 0.307*** 0.289*** 0.283*** 0.298*** 0.290*** 0.286*** 0.276***
## (0.017) (0.017) (0.017) (0.019) (0.019) (0.023) (0.027) (0.027) (0.026) (0.026) (0.027) (0.026)
## pH -0.850*** -0.808*** -0.641*** -0.091 -0.002 0.050 -0.099 -0.250 -0.247 -0.414*
## (0.116) (0.121) (0.120) (0.178) (0.191) (0.192) (0.186) (0.189) (0.189) (0.192)
## density 13.811 -2.088 -59.385*** -76.597*** -74.830*** -31.379 -24.284 -25.516 -17.881
## (11.119) (11.061) (17.602) (22.177) (22.171) (21.688) (21.678) (21.695) (21.633)
## sulphates 0.872*** 0.930*** 0.956*** 0.903*** 0.734*** 0.885*** 0.899*** 0.916***
## (0.106) (0.106) (0.108) (0.111) (0.108) (0.114) (0.115) (0.114)
## fixed.acidity 0.090*** 0.105*** 0.088*** 0.074** 0.054* 0.052* 0.025
## (0.022) (0.025) (0.026) (0.025) (0.025) (0.025) (0.026)
## residual.sugar 0.020 0.017 0.011 0.010 0.014 0.016
## (0.015) (0.015) (0.015) (0.015) (0.015) (0.015)
## citric.acid 0.263* -0.489*** -0.361* -0.357* -0.183
## (0.127) (0.139) (0.143) (0.143) (0.147)
## volatile.acidity -1.305*** -1.195*** -1.200*** -1.084***
## (0.116) (0.119) (0.119) (0.121)
## chlorides -1.586*** -1.611*** -1.874***
## (0.417) (0.418) (0.419)
## free.sulfur.dioxide -0.002 0.004*
## (0.002) (0.002)
## total.sulfur.dioxide -0.003***
## (0.001)
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4 0.4 0.4
## adj. R-squared 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.4
## sigma 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6
## F 468.3 468.3 268.9 179.8 157.6 130.8 109.3 94.5 105.1 95.8 86.4 81.3
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1721.1 -1694.5 -1693.7 -1660.3 -1651.6 -1650.8 -1648.6 -1587.2 -1580.0 -1579.2 -1569.1
## Deviance 805.9 805.9 779.5 778.8 746.9 738.8 738.1 736.1 681.7 675.5 674.8 666.4
## AIC 3448.1 3448.1 3396.9 3397.4 3332.6 3317.2 3317.6 3315.3 3194.5 3182.0 3182.4 3164.3
## BIC 3464.2 3464.2 3418.4 3424.3 3364.8 3354.8 3360.6 3363.7 3248.3 3241.2 3246.9 3234.2
## N 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599 1599
## =====================================================================================================================================================================
种整个项目做下来,觉得舒服,从一开始不知道该从哪下手 到后来一步步研究,从一个变量到两个到多个,从一个方面到多个方面对 数据集有了更立体的认识。
之前选择除了酒精含量以外的变量进行分析,不好观察。 现在直接用质量评级进行分类,果然变得更加清晰
这个数据集研究的目的是红葡萄酒中的哪种成分会让他获得更高的评级分数 他只包含葡萄酒中成分的介绍,而众所周知,红葡萄酒的产地,品牌, 年份对酒本身有着很大的影响 以后的工作中会希望能够完善这方面的信息